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Abstract 

Adaptive behavior depends less on the details of the negotiation process and makes more robust predictions in the long 
term as compared to in the short term. However, the extant literature on population dynamics for behavior adjustment has 
only examined the current situation. To offset this limitation, we propose a synergy of evolutionary algorithm and 
reinforcement learning to investigate long-term collective performance and strategy evolution. The model adopts 
reinforcement learning with a tradeoff between historical and current information to make decisions when the strategies of 
agents evolve through repeated interactions. The results demonstrate that the strategies in populations converge to stable 
states, and the agents gradually form steady negotiation habits. Agents that adopt reinforcement learning perform better in 
payoff, fairness, and stableness than their counterparts using classic evolutionary algorithm. 
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Introduction 

Uncertainty in negotiation can be caused by a number of 
factors, such as fuzzy opponent type, unknown strategies, and 
deadlines [1]. Hence, one of the central topics in negotiation is 
how to design agents with higher adaptability for changing 
circumstances [2,3]. In the environment of incomplete informa- 
tion, the primary concern is how to acquire more information and 
use it appropriately to reach consensus through concession in 
negotiation [4-6]. Through feedback, the agents learn how to 
make future decisions. Because the learning effect in the short term 
depends strongly on the details of the negotiation process and the 
characteristics of the opponent, it is difficult to evaluate a learning 
model through sparse interactions. In addition, because heteroge- 
neous agents behave differendy, it is reasonable to assess the 
performance from a macroscopic view [7,8]. 

To address the uncertainty, learning in the long term 
emphasizes strategy selection and adjustment through repeated 
negotiations using qualitative or quantitative methods. For 
instance, Eduard Gim'enez-Funes et al. have applied a qualitative 
approach, case-based reasoning, to find the appropriate strategy 
by comparing the similarity of current situations to history [9] . 
Matos et al. have applied a widely used quantitative model, an 
evolutionary algorithm, to investigate long-term behavior [10]. 
This approach is derived from biology, with simple rules to 
evaluate payoff [11]. A number of researchers have investigated 
strategy evolution in the long term in recent years [12-17]. In the 
genetic algorithm, agents with higher fitness are passively selected 
and put into the mating pool to replicate the next generation, 
without considering the learning behavior of the agents. This 
approach myopically assesses the performance of agents with 
payoff only in the current period, while humans learn by weighing 
both the historical information and the current performance. If the 



agent represents a person in reality, it is reasonable to incorporate 
individual learning because humans adjust strategy through 
experience with initiative. 

Fudenberg and Levine have investigated long-term strategy 
dynamics, including replicator dynamics and reinforcement 
learning [18]. Because reinforcement learning is a type of 
individual learning while the evolutionary approach concerns 
population dynamics, prior literature has generally examined them 
separately. However, Borgers, T. and R. Sarin [19] find that a 
type of continuous time reinforcement learning can converge to an 
equilibrium of replicator dynamics, which indicates some interac- 
tion between population dynamics and individual reinforcement 
learning. Reinforcement learning, with its basis in psychology 
[20,21], evaluates the reward by weighting historical and current 
payoffs and has been applied to human strategy adjustment as well 
as to artificial intelligence [22-24]. Recently, researchers have 
begun to integrate different learning approaches to determine an 
agent's optimal strategy in the case of incomplete information. 
Reinforcement learning is a good fit when information on the 
opponent and environment is limited. To this end, agents in our 
model adopt reinforcement learning to calculate the reward of 
each strategy and then use replicator dynamics to adjust the 
probability of strategies. We integrate replicator dynamics and 
reinforcement learning to explore the efficiency, fairness, and 
strategy convergence in negotiation. In addition to the efficiency 
and strategy evolution, fairness has also been a concern of many 
researchers [25,26]. The simulation results indicate that our 
approach achieves higher reward, shorter negotiation time, and a 
lower degree of greediness of strategies than the classic evolution 
model. It is also shown that the weight tradeoff between current 
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and historical experience impacts the negotiation performance and 
learning effect to a large extent. 

Methods 

This model is based on the alternate offering mechanism of 
Rubinstein, in which agents adopt the time-dependent concession 
function [10]. The agents use reinforcement learning to accumu- 
late experience and update the probability of each strategy using 
replicator dynamics. The negotiation process is illustrated in 
Figure 1. 

Negotiation rules 

The reservation prices R s and Rt of the sellers and buyers are 
uniformly distributed between P m \ n and P max - In each period, 



seller-buyer pairs are randomly selected and begin to negotiate if 
Rs<Rb- The offering intervals for a buyer and a seller are 
(P m in,Rb) and (R s ,P m ax), respectively. They alternate in making 
offers, and the discount rates of sellers and buyers are C s and Q,, 
respectively (the first offer is proposed by the buyer by default). If 
an agent rejects the offer of the opponent, he then proposes his 
own offer (P s for the seller and P/, for the buyer), and the accepted 
price is P. The payoff of the seller is U s = (P — R s ) * C' s , and the 
payoff of the buyer is Ub = (Rb — P)* C' h , where t is the negotiation 
time. 

Concession functions 

The time-dependent strategies for buyers and sellers are Pb{t) 
and P s {t), respectively, and the concession rate increases with 
time. Pb(t) and P s (/)are represented in equation (1) as follows: 
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Figure 1. The negotiation model. The circles denote buyers; triangles denote sellers, and ellipses denote environmental factors. The values 
x x ,x 2 ,x l are probabilities of the frugal, cool-headed and anxious strategies of buyers, respectively, and y l ,y 2 ,y } are the corresponding strategies for 
the sellers. The shaded area represents the selected strategy in the current period. At the end of each negotiation period, the buyers (sellers) calculate 
the reward through reinforcement learning and update x l ,x 2 ,x^(y 1 ,y 2 ,y > ) accordingly. The process continues until the strategies of all agents in the 
market converge to a stable state. 
doi:1 0.1 371 /journal.pone.01 02840.g001 
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where g denotes the time pressure of negotiators, and k' b ,k\(Q<k' h> 
k{<\) denote the greediness of buyers and sellers, with smaller k' h 
and k ] s values indicating higher levels of greediness. Three 
strategies with different degrees of greediness exist for each 
population: 0<k l h <k 2 h <kl<l, Q<k]<k 2 s <k]<\. Here, k\ rep- 
resents the frugal strategy; k\ represents the cool-headed strategy; 
and k\ represents the anxious strategy for the buyers as well as for 
the sellers. A refers to the initial offer parameter, which creates an 
offer closer to the reservation price, with a larger A; A refers to the 
concession type, which is defined as a convex function for buyers 
and a concave function for sellers when X > 1 . The buyer and the 
seller propose offers alternately until an offer is accepted, which 
occurs when the agent receives a higher payoff than refusing it by 
proposing his own offer, which is expected to be accepted by the 
opponent in the next round. The buyer accepts an offer P] when 
R h -P s l >C h (R h -P h t+l ), and a seller accepts P? when 

p';-R s >c s (p; +i -R s ). 

Learning rules 

The agents update their rewards according to feedback based 
on historical information and eventually develop a negotiation 
habit [27]. Due to the constantly changing environment, the 
reference value of information decreases with time, and thus, the 
agents assign different weights to historical and current payoffs. 
The reward function is defined as follows: 



u\ = w * u\_ [ +(1 — w) * v\, 



(2) 



where u\ is the average reward of strategy k' in period t, w is the 
weight of the historical payoff, and vj is the average current payoff 
strategy k' . At first, each agent has a subjective probability for 
every strategy and thereby chooses one strategy to negotiate. In 
period t, the agent chooses strategy k 1 with probability x\, k 2 
withx^, k 3 with x], and x) + xj + x] = 1. The agent updates the 
probabilities according to the rewards of each strategy, where a 
higher reward leads to an increased probability in the next period 
and vice versa. Until the total negotiation frequency reaches N, the 
agents adjust their strategy, which is defined as a learning period. 
The adjustment refers to replicator dynamics, but it is slighdy 
modified to fit the reality: 



fj +l = A*x„ if \x(t+l)-x(i)\>s 
u, 

fj +1 =x, + s, if 0<x(t+l)-x(t)<s 
^f t ' +1 =x t —s, if -s<x(t+ \)-x(t)<0, 



(3) 



ft+i 



J2f!+i 

k=i 



(4) 



where u t = ^2 x \ * u 't ' s the average payoff, and u\ is the payoff of 
i = l 

strategy k' in period t. In Eq.(3),// +1 is the temporary probability 

3 

of strategy i, and X^/Z+i^oes not necessarily equal 1. In Eq.(4), 



3 

is the normalized result, which ensures Yl x 't+\' 



1. In 



period t+1, the probability of strategy k' is proportional to 

u, 

which means that the probability increases in the next period 
when u\ is more than w7 and decreases in the opposite case. We 
define the adjustment precision as s with the default value of 0.01. 
When the change in probability is smaller than s 
(\x(t + 1) — x(t)\ <s), the adjustment size is s. When x' ( <0.01, we 
set x\ = 0, which means that this strategy has vanished. When 
x\ >0.99, we set x\ = 1, which means that the agent has converged 
to this strategy. In addition, US denotes the payoff of the seller 
without learning, and US* denotes the payoff with learning. 
Similarly, UB and UB* denote the corresponding meanings for 

UB 

the buyer. Fairness without learning is evaluated by 

UB* 

fairness with learning is evaluated by — — . 



US' 



and 



Experiments 

The first experiment investigates the general performance of the 
agents when g = 6, k 1 = 0.'i,k 2 = 0.5,A: 3 = 0.7, and w = 0.9. The 
buyer proposes the initial offer, and the seller continues to 
negotiate, with the discount rate C s and Cb changing within the 
range of 0.5-1 and a minimum adjustment size of 0.025. The 
parameters in the control group without learning are the same as 
in the above-mentioned group except that agents in the control 
group do not adjust the probabilities of strategies. 

The second experiment explores the impact of the weights of 
historical information on the negotiation result and strategy 
convergence. The adjustment range of to is 0.2—0.9, and the size is 
0.1. Other parameters are the same as in the first experiment. We 
observe the negotiation result and calculate the related variables 
using the formulas in Table S 1 . 

Results 

Performance of the negotiation agents 

The positive growth rate of the payoff in most cases means that 
the learning approach is beneficial to both buyers and sellers. As 
the discount rate decreases (see Figure 2A), the growth rate 
gradually increases. The reason may be that a lower discount rate 
encourages the intention of accepting an offer. The negotiation 
time is stable and only varies slighdy with the change in discount 
rate, so it is not illustrated in Figure 2. Figure 2B illustrates the 
fairness with and without learning, where the noticeable difference 
between the two indicates that learning has changed the division of 
payoff between buyers and sellers. The distinction of the growth 
rate of payoffs in Figure 2 A between the two populations is also the 
major reason for the variation of fairness with learning. Initialized 
with random subjective strategy probabilities, most agents finally 
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Figure 2. Negotiation performance of the agents. Figure 2A illustrates the growth rate of the average payoff with learning for buyers, sellers, 
and joint payoff compared to the growth rate without learning. Figure 2B illustrates the fairness comparison between the group with learning and 
the group without learning. Figure 2C shows the number of agents of each strategy in the convergence results, as every agent converges ultimately 
to a pure strategy. 

doi:1 0.1 371 /journal.pone.01 02840.g002 



converge to a cool-headed strategy. Only a small portion of agents 
converge to an anxious strategy or frugal strategy after periods of 
evolution. While the convergence results of the two populations 
are somewhat different, the patterns of strategy distribution are 
similar. The result shows that learning has changed the habit of 
strategy adoption, and most agents become sensible by using a 
cool-headed strategy despite lacking emotion. 

Impact of weight of historical information 

In Figure 3A, the joint payoff of buyers and sellers changes only 
slightly when the weight increases from 0.5 to 0.9. As the weight 
drops below 0.5, the payoff declines sharply with the decreasing 
discount rate and reaches a minimum at approximately 0.8. It is 
obvious that the agents who emphasize historical experience 
perform better than the agents who ignore it. In Figure 3B, the 
negotiation time remains stable when the weight is above 0.5 and 
rises obviously until the weight drops below 0.5. The discount rate 
in our model indicates that more rounds of negotiation lead to 
lower payoff and inefficiency of the market. In Figure 3C, there is 
only a minor fluctuation of fairness when the weight rises above 
0.5, and fluctuation becomes noticeable when the weight drops 
below 0.5. It can be concluded that the profit division of the 
market becomes unfair if the agents rarely consider prior 
information. 



To summarize, agents who value long-term experience achieve 
more stable and efficient performance such as joint payoff, 
negotiation time, and fairness. 

Figures 4A, 4B and 4C illustrate the impacts of the weight of 
historical information on buyers. The figures demonstrate that the 
convergence strategies remain stable when the weight is above 0.4 
but vary significantly when the weight drops below 0.4. As the 
weight is high, the majority of the market adopts the cool-headed 
strategy, with the anxious and the frugal strategies as minorities. 
When the weight decreases, the advantage of the cool-headed 
strategy declines, while the frugal strategy increases and the 
anxious one remains stable. The results only change slightly with 
decreasing discount rate when the weight is high. In contrast, the 
results fluctuate substantially as the weight reaches a low level. 

Figures 4D, 4E and 4F illustrate the impacts of the weight of 
historical information on the convergence of the frugal, cool- 
headed, and anxious strategies of the sellers, respectively. The 
general trend shows that the greediness of the agents rises notably 
when the weight decreases. Specifically, we find that the frugal 
strategy increases, while the cool-headed strategy simultaneously 
decreases. The frugal strategy is the minority when the weight is 
high and rises gradually when the weight drops, and it proves to be 
dominant at a high discount rate when the weight drops to 
approximately 0.2(see Figure 4D). The cool-headed strategy shows 




Figure 3. Impact of historical weight on payoff, time and fairness. Figures 3A, 3B and 3C illustrate the change in joint payoff, negotiation 
time, and fairness with different weights, compared with a weight of 0.8 as the baseline. Because the performance of agents with weights of 0.8 and 
0.9 are very similar, the results for a weight of 0.9 are not displayed in Figure 3. To concisely demonstrate the result, we select the representative data 
with weights between 0.2 to 0.7 based on the benchmark of 0.8. 
doi:1 0.1 371 /journal.pone.01 02840.g003 
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Figure 4. Impact of historical weight on the strategy convergence. Figures 4A, 4B and 4C illustrate the evolution results of the frugal, cool- 
headed, and anxious strategies of buyers using different weights. The Y axis represents the number of agents using each strategy in the population. 
Similarly, Figures 4D, 4E and 4F represent the corresponding results for sellers. 
doi:1 0.1 371 /journal.pone.01 02840.g004 



the opposite trend to the frugal one and declines as the weight 
decreases (see Figure 4E). As the discount rate decreases, the frugal 
strategy decreases, and the cool-headed strategy increases. The 
cool-headed strategy becomes dominant when the discount rate is 
very low, while the frugal strategy is a tiny minority. This result 
indicates that the low discount rate's leading to less greediness in 
agents may be caused by the time pressure to accept offers more 
actively. The anxious strategy remains at a low level when the 
weight is high and fluctuates substantially when the weight is 
below 0.5 (see Figure 4F). In conclusion, the strategy convergence 
remains stable and the cool-headed strategy has a noticeable 
advantage when the weight of historical information is high. The 
greediness of the whole market rises as the weight drops. 

To summarize, the impacts of weight on convergence strategies 
are different between buyers and sellers. The strategies of sellers 
vary gradually with the weight, while the strategies of the buyers 
remain relatively stable when the weight is above 0.5 and change 
rapidly when the weight falls below 0.5. Because the only 
distinction between the two populations is the order of proposing 
the first offer, the asymmetry may arise from this factor. Therefore, 
the offering order not only affects the division of the payoff 
between buyers and sellers but also gives rise to the difference in 
the strategy distribution between them. The agents update their 
strategy probability after a period of negotiation during which the 
costs of agents remain stable, and therefore the frugal strategy 
achieves a higher payoff. The agents use a more greedy strategy 
when the weight is lower. Our results show the following: (1) 
myopic adjustment leads to a more frugal strategy with less 



concession, and (2) the algorithm with reinforcement learning 
results in a more cool-headed strategy with more efficient 
performance in the overall population, such as less negotiation 
time and less fluctuation in fairness. 

Discussion 

The evolutionary approach is effective in investigating the 
collective behavior of the population in the long term. However, 
human learning involves more initiative than biological evolution, 
and therefore, we have integrated reinforcement learning with 
replicator dynamics to investigate negotiation behavior. Negotia- 
tion strategies are generally complex, and a new strategy type is 
created by design instead of mutation. In the genetic algorithm, 
the population of each generation is created by passive selection 
and reproduction [28-30], but agents representing humans usually 
will not depart the market even if they suffer from occasional loss 
in negotiation. Rather, they make decisions using initiative and 
accumulate experience through multiple periods of negotiation. 
To this end, this model evaluates the rewards of strategies by 
assigning weights to historical payoffs as well as to current ones. As 
a result, our learning pattern incorporating replicator dynamics 
differs from classic reinforcement learning, which determines the 
probability of strategies in proportion to the rewards [31]. 
Reinforcement learning has many models that differ from each 
other in details such as the probability determination rules. Rajiv 
Sarin and Farshid Vahid design a simple reinforcement learning 
model without probability, in which the agents choose the strategy 
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with the highest reward instead of through subjective probability 
[32]. Borgers believes that individual learning is a process of idea 
evolution as well as habit formation and proves that a continuous- 
time reinforcement learning of an individual converges to 
equilibrium of replicator dynamics [19]. We adopt replicator 
dynamics as the strategy adjustment rule and incorporate 
reinforcement learning into the payoff evaluation. The simulation 
results suggest that agents using this new learning model achieve 
higher payoff, shorter negotiation time, and more stable fairness 
than agents using the classic evolutionary approach. 

This paper presents two experiments designed to study the 
general performance of agents and the impacts of the weight of 
historical information on the negotiation result and convergence 
strategy. The results indicate that in most cases, learning increases 
the payoff of both buyers and sellers. Thus, the learning pattern is 
beneficial to both sides, and the growth rate rises with decreasing 
discount rate. Ravindra Krovi et al. [33] compare the payoff and 
fairness when one or two variables are controlled. However, they 
examine the evolution of offer instead of strategy, which is 
different from this paper. We evaluate the learning effect from the 
perspective of the population instead of the individual agent. 

In addition to the market efficiency and fairness, we also 
consider long-term strategy evolution. In our model, all the agents 
converge to pure strategy and form stable habits. Although 
heterogeneous agents have different reservation values and initial 
states regarding strategies, the strategy distribution of the whole 
market is relatively stable, and the convergent results vary slighdy 
with the initial settings. Noyda Matos et al. [10] allow the agents to 
use mixed strategies, but the proportion of each strategy is similar, 
and there is no dominant strategy. In our model, the majority is 
the cool-headed strategy, which means that the agents become 
rational by learning in the long run, although we do not consider 
psychological factors. 
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